|  |  |
| --- | --- |
| **Name:** | *John Li* |
| **NetID:** | *Johnwl2* |
| **Section:** | *AB* |

**ECE 408/CS483 Milestone 3 Report**

|  |
| --- |
| 1. List Op Times, whole program execution time, and accuracy for batch size of 100, 1k, and 10k images from your basic forward convolution kernel in milestone 2. This will act as your baseline for this milestone. |
| |  |  |  |  |  | | --- | --- | --- | --- | --- | | Batch Size | Op Time 1 | Op Time 2 | Total Execution Time | Accuracy | | 100 | *0.178091 ms* | *0.636865 ms* | *0m1.730s* | *0.86* | | 1000 | *2.21351 ms* | *8.81894 ms* | *0m11.151s* | *0.886* | | 10000 | *21.6647 ms* | *86.24 ms* | *1m53.180s* | *0.8714* |   Nsys profiling results for batch size 100 below: |
| 1. **Optimization 1: *Sweeping various parameters to find best values*** |
| * 1. Which optimization did you choose to implement and why did you choose that optimization technique?   My first optimization to kick off from the baseline implementation is to sweep various parameters to find the most optimal values. I chose this optimization because I was curious as to how simply adjusting the values and sizes of certain variables and macros would impact the performance and efficiency of my convolution forward kernel. |
| * 1. How does the optimization work? Did you think the optimization would increase performance of the forward convolution? Why? Does the optimization synergize with any of your previous optimizations?  The optimization works by trying various sizes or values for the parameters in my forward convolution kernel (in this case, I found my TILE\_WIDTH to have the most effective influence) and then determining which value is the best. I found TILE\_WIDTH 12 to be the best. I think this optimization increases the performance of the forward convolution because a TILE\_WIDTH of 12 is not too large and not too small to use for tiling. (My original TILE\_WIDTH for the baseline was size 32.) Other values for TILE\_WIDTH I tried were 4, 64 (TILE\_WIDTH 64 didn’t even give the correct accuracy), and 16, but their resulting performances weren’t as strong compared to the performance when TILE\_WIDTH = 12. This was my first optimization, so there were no optimizations prior to it: just the baseline before. |
| * 1. List the Op Times, whole program execution time, and accuracy for batch size of 100, 1k, and 10k images using this optimization (including any previous optimizations also used).  |  |  |  |  |  | | --- | --- | --- | --- | --- | | Batch Size | Op Time 1 | Op Time 2 | Total Execution Time | Accuracy | | 100 | *0.205364ms* | *0.550425 ms* | *0m1.220s* | *0.86* | | 1000 | *1.96903 ms* | *5.47458 ms* | *0m10.966s* | *0.886* | | 10000 | *19.613 ms* | *54.4653 ms* | *1m46.744s* | *0.8714* |  * 1. Was implementing this optimization successful in improving performance? Why or why not? Include profiling results from *nsys* and *Nsight-Compute* to justify your answer, directly comparing to your baseline (or the previous optimization this one is built off of).  Yes, it was successful in improving performance. This is most likely because a tile size of 12 is not too large yet not too small for a kernel to perform forward convolution, so it ends up being the most ideal size. An alternative way to phrase this is that generically speaking, every dataset has an ideal size. And in my particular situation, the ideal tile size/width was 12.  Nsys profiling results for batch size 100 below: |
| As you can see from the nsys profiling, the average time it took for the forward convolution kernel implementation is 369,020.5 nanoseconds which is faster than the baseline forward convolution kernel average time of 597,437 nanoseconds. The same trend is present for other time statistics like the total time, minimum time, etc. in the profiling. Furthermore, the OP times and total execution times under this implementation are faster than those of the baseline. For example the OP times for this optimization for a batch size of 10000 was ~19.6 ms and ~54.5 ms respectively, while the baseline implementation had times of 21.67 ms and 86.24 ms for the same batch size.   * 1. What references did you use when implementing this technique?   None. |
| 1. **Optimization 2: *Sweeping various parameters to find best values +* Tuning with restrict and loop unrolling because restricting** |
| 1. Which optimization did you choose to implement and why did you choose that optimization technique? |
| *I chose to stack my previous optimization with the* “*Tuning with restrict and loop unrolling” because I believed tuning certain parameters with the restrict keyword and unrolling the loops in my convolution forward kernel manually would be faster than structuring it as regular for loops. Furthermore, this optimization is simple to implement and understand.* |
| 1. How does the optimization work? Did you think the optimization would increase performance of the forward convolution? Why? Does the optimization synergize with any of your previous optimizations?   First you tweak certain parameters passed into the convolution forward kernel function with the restrict keyword. In particular, you change const float \* x and const float \* k to const float \* \_\_restrict\_\_ x and const float \* \_\_restrict\_\_ k. Then you need to take the double for loop which originally looped from 0 to K twice (because it was a x2 for loop) and “unroll” it or manually perform the floating-point add computation with the actual numbers 0-6 (7 times because kernel size is 7x7). I believe this optimization would increase the performance of the forward convolution because having the kernel go through a loop is slightly less efficient than calculating the computation directly multiple times. Furthermore, I believe using the \_\_restrict\_\_ keyword will significantly help the compiler optimize code because it doesn’t have to worry that a write to a restrict pointer will cause a value read from another restrict pointer to change. Also, this optimization synergizes with the previous “sweeping various parameters to find best values” optimization; I stacked this current optimization on top of the previous one. |
| 1. List the Op Times, whole program execution time, and accuracy for batch size of 100, 1k, and 10k images using this optimization (including any previous optimizations also used). |
| |  |  |  |  |  | | --- | --- | --- | --- | --- | | Batch Size | Op Time 1 | Op Time 2 | Total Execution Time | Accuracy | | 100 | *0.197431 ms* | *0.536113 ms* | *0m1.168s* | *0.86* | | 1000 | *1.98221 ms* | *5.48225 ms* | *0m10.395s* | *0.886* | | 10000 | *20.3537 ms* | *54.8231 ms* | *1m44.959s* | *0.8714* | |
| 1. Was implementing this optimization successful in improving performance? Why or why not? Include profiling results from *nsys* and *Nsight-Compute* to justify your answer, directly comparing to your baseline (or the previous optimization this one is built off of).   This optimization was slightly successful in improving performance. First, the \_\_restrict\_\_ keyword makes the compiler not have to worry that a write to a restrict pointer will cause a value read from another restrict pointer to change. As for the unrolling aspect, it is useful because it eliminates the conditional check and branch portion of a loop, which can be expensive. It also allows the compiler to pre-compute what the indices and values are and insert them in place where the loop counter would be. The biggest setback is that it can bloat code size which can affect memory usage. Since I practically copy-pasted the same line/computation 49 times, that appears to bloat my code size which may offset some of the headway I’ve been making in regards to performance improvements. Thus, the overall resulting performance is only slightly more successful as compared to before.  Nsys profiling results for batch size 100 below:  From the Nsys profiling results above, unrolling + leveraging *\_\_restrict\_\_* cut down slightly on overall times compared to the previous optimization where I sweeped various parameters to find the most optimal values (i.e. the TILE\_WIDTH being 12). For example, the average time is 356,477 nanoseconds and the total time is 712954 ns for this optimization, which is slightly faster than the total and average times of the prior optimization which were 738041 ns for total time and 369020 ns for average time. Furthermore, the OP times and total execution times under this implementation are, in general, slightly faster or practically the same than those of the previous implementation. For instance, for a batch size of 100, the respective OP times were 0.197431 ms and 0.536113 ms, while the previous optimization’s OP times (without this current one) for the same batch size were 0.205364 ms and 0.550425 ms for the same batch size. |
| 1. What references did you use when implementing this technique?   [https://developer.nvidia.com/blog/cuda-pro-tip-optimize-pointer-aliasing/](about:blank) |

|  |
| --- |
| 1. **Optimization 3: *Weight Matrix (kernel values) in constant memory + Tuning with restrict keywords + loop unrolling + Sweeping various parameters to find best values***   ***(Delete this section blank if you did not implement this many optimizations.)*** |
| * 1. Which optimization did you choose to implement and why did you choose that optimization technique? |
| *Weight matrix (kernel values) in constant memory and the baseline GPU convolution forward kernel (from PM2). I chose to implement this particular optimization technique because I believed using (extra) constant memory to perform the primary computations for the GPU convolution forward kernel (PM2) would make the computations as a whole faster.* |
| * 1. How does the optimization work? Did you think the optimization would increase performance of the forward convolution? Why? Does the optimization synergize with any of your previous optimizations? |
| *First you declare a block of constant memory globally with size 3136. 3136 comes from kernel dimension \* kernel dimension \* sizeof(float) \* C (# of channels) = 7 \* 7 \* 4 \* 16. Then you insert or replace the block of constant memory into the forward convolution algorithm/implementation. As long as threads in the same warp access the same address, leveraging constant memory is a very fast and effective solution. In my case, all the threads I am using in each warp access the same address. This optimization synergizes with “tuning with restrict and loop unrolling” and “sweeping various parameters to find best values”; in fact, I stacked this optimization on top of those previous ones.* |
| * 1. List the Op Times, whole program execution time, and accuracy for batch size of 100, 1k, and 10k images using this optimization (including any previous optimizations also used). |
| |  |  |  |  |  | | --- | --- | --- | --- | --- | | Batch Size | Op Time 1 | Op Time 2 | Total Execution Time | Accuracy | | 100 | *0.143663 ms* | *0.389186 ms* | *1.143 s* | *0.86* | | 1000 | *1.69953 ms* | *3.93012 ms* | *10.189 s* | *0.886* | | 10000 | *16.8401 ms* | *39.1409 ms* | *1 min 42.878 s* | *0.8714* | |
| * 1. Was implementing this optimization successful in improving performance? Why or why not? Include profiling results from *nsys* and *Nsight-Compute* to justify your answer, directly comparing to your baseline (or the previous optimization this one is built off of).  Yes, this optimization was successful in improving performance because as long as threads in the same warp access the same address (which is precisely what occurred in my program), leveraging constant memory becomes a very fast and effective solution.  Nsys profiling results for batch size 100 below:     As one can see from the Nsys profiling results above, the average time is 256655.5 ns and the total time is 513309 ns. This is significantly better than the previous runtimes, which were 738041 ns for total time and 369020 ns for average time. The same trend can be extrapolated toward the other time metrics (min time, etc.) as well. Moreover, the OP times and total execution times under this implementation are faster than those of the previous implementation (first and second OP times were 16.8401 ms and 39.1409 ms for a batch size of 10000). Leveraging constant memory, along with having a TILE\_WIDTH = 12 (as a result of sweeping for the most optimal parameters) and tuning with \_\_restrict\_\_ and loop unrolling, appears to be my best optimization yet.   * 1. What references did you use when implementing this technique?   [http://cuda-programming.blogspot.com/2013/01/what-is-constant-memory-in-cuda.html](about:blank) |
| 1. **Optimization 4: Tiled shared memory convolution + sweeping various parameters to find the best values.** |
| * 1. Which optimization did you choose to implement and why did you choose that optimization technique? |
| *I chose to do tiled shared memory convolution because I believed utilizing shared memory and more advanced tiling methods would cut down on overall runtimes since the fundamental forward convolution kernel don’t use such strategies to their advantage.* |
| * 1. How does the optimization work? Did you think the optimization would increase performance of the forward convolution? Why? Does the optimization synergize with any of your previous optimizations? |
| *At its core, this optimization uses shared memory to assist with forward convolution, but the general forward convolution algorithm would remain the same (so same idea as baseline implementation). So I think this optimization would enhance performance because it makes use of extra (shared) memory. Also, I synergized this optimization with the first optimization: sweeping various parameters to find the best values.* |
| * 1. List the Op Times, whole program execution time, and accuracy for batch size of 100, 1k, and 10k images using this optimization (including any previous optimizations also used). |
| |  |  |  |  |  | | --- | --- | --- | --- | --- | | Batch Size | Op Time 1 | Op Time 2 | Total Execution Time | Accuracy | | 100 | *0.246051 ms* | *0.645248 ms* | *0m1.194s* | *0.86* | | 1000 | *2.34191 ms* | *6.34975 ms* | *10.134s* | *0.886* | | 10000 | *23.1343 ms* | *63.5048 ms* | *1m39.583s* | *0.8714* | |
| * 1. Was implementing this optimization successful in improving performance? Why or why not? Include profiling results from *nsys* and *Nsight-Compute* to justify your answer, directly comparing to your baseline (or the previous optimization this one is built off of).   *This optimization was actually not successful in improving performance. After doing some research on the internet, I learned that if I am using the data once and there is no data reuse between various threads within a block, using shared memory may actually take more time than global memory. And the reason behind this is when you copy data from global memory to shared, it still classifies as a global transaction, which takes time. Reads are faster when accessing shared memory, but it doesn’t really matter if you already had to read the memory once from global memory.*  Nsys profiling results with batch size 100:  As you can see from the Nsys profiling results (average time, total time, etc.) and the OP times/total execution times from the table above, the time performances either didn’t really change much or became worse than the time performances prior to utilizing tiled shared memory convolution. For instance, the OP times for a batch size of 10000 for this implementation were 23.1343 ms and 63.5048 ms, but without this optimization, the OP times for the same batch size were 19.613 ms and 54.4653 ms. |
| * 1. What references did you use when implementing this technique?   *-Kirk and Hwu Chapter 16,*  *-*[*https://stackoverflow.com/questions/10250325/cuda-shared-memory-not-faster-than-global*](about:blank) |
| 1. **Optimization 5: Fixed point (FP16) arithmetic + sweeping various parameters to find the best values**   ***(Delete this section if you did not implement this many optimizations.)*** |
| * 1. Which optimization did you choose to implement and why did you choose that optimization technique? |
| *I chose to do FP16 or fixed point arithmetic because floats are 32 bits long, and you don’t necessarily need 32 bits to represent every single data value. I believed truncating down to 16 bits (in CUDA, this is called a “half” data type) would help cut down on extra time. Furthermore, I wanted to gain some more experience handling different data types (although this reason is not specifically related to increasing time performances).* |
| * 1. How does the optimization work? Did you think the optimization would increase performance of the forward convolution? Why? Does the optimization synergize with any of your previous optimizations? |
| *Essentially, there are several float values and floating point operations throughout the main forward kernel. Instead of using floats, convert to \_\_halfs, and then keep everything else the same. I believed truncating down to 16 bits/converting to \_\_half data type would help cut down on extra time and increase the performance of my forward convolution because I wouldn’t have to deal with potentially extraneous bits when using floats (which has 2x the number of bits as halfs). I did synergize this optimization with my first optimization: Sweeping various parameters to find the best values (i.e. TILE\_WIDTH = 12).* |
| * 1. List the Op Times, whole program execution time, and accuracy for batch size of 100, 1k, and 10k images using this optimization (including any previous optimizations also used). |
| |  |  |  |  |  | | --- | --- | --- | --- | --- | | Batch Size | Op Time 1 | Op Time 2 | Total Execution Time | Accuracy | | 100 | *0.210636 ms* | *0.593005 ms* | *1.149s* | *0.86* | | 1000 | *2.11955 ms* | *6.01826 ms* | *10.175s* | *0.886* | | 10000 | *21.1560 ms* | *59.9124 ms* | *1m40.506s* | *0.8714* | |
| * 1. Was implementing this optimization successful in improving performance? Why or why not? Include profiling results from *nsys* and *Nsight-Compute* to justify your answer, directly comparing to your baseline (or the previous optimization this one is built off of).   This optimization did not appear to be very successful in improving performance. In fact, it appeared to have made the forward convolution slightly slower. The most likely cause behind this is that I had to use extra function calls, \_\_float2half(), to convert some of the float values into halves, which can cost some extra overhead time. Furthermore, I can also deduce that converting from float to a different data type does not seem to change anything because the actual values, or methods of computation never changed, so the performance times should not change very drastically because of that factor.  Nsys profiling results with batch size 100:    As you can see from the Nsys profiling results (average time, total time, etc.) and the OP times/total execution times from the table above, the time performances either didn’t really change much or became worse than the time performances prior to FP16 arithmetic. For instance, the OP times for a batch size of 10000 for this implementation were *21.1560 ms* and *59.9124 ms*, but without this optimization, the OP times for the same batch size were 19.613 ms and 54.4653 ms. |
| * 1. What references did you use when implementing this technique? |
| *-*[*https://docs.nvidia.com/cuda/cuda-math-api/group\_\_CUDA\_\_MATH\_\_\_\_HALF\_\_MISC.html*](about:blank) |
|  |
|  |
|  |
|  |
|  |
|  |
|  |
|  |
|  |
|  |